Writing Quality, Knowledge, and Comprehension Correlates of Human and Automated Essay Scoring
نویسندگان
چکیده
Automated essay scoring tools are often criticized on the basis of construct validity. Specifically, it has been argued that computational scoring algorithms may be unaligned to higher-level indicators of quality writing, such as writers’ demonstrated knowledge and understanding of the essay topics. In this paper, we consider how and whether the scoring algorithms within an intelligent writing tutor correlate with measures of writing proficiency and students’ general knowledge, reading comprehension, and vocabulary skill. Results indicate that the computational algorithms, although less attuned to knowledge and comprehension factors than human raters, were marginally related to such variables. Implications for improving automated scoring and intelligent tutoring of writing are briefly discussed. Automated Essay Scoring Automated writing evaluation (AWE) uses computational tools to grade and give feedback on student writing (Shermis & Burstein, 2003). Studies have reported scoring reliability and accuracy in terms of significant, positive correlations between human and automated ratings, high perfect agreement (i.e., exact match between human and automated scores), and adjacent agreement (e.g., human and automated scores within one point on a 6-point scale). AWE has been touted as a means to efficiently score large numbers of essays and enable classroom teachers to offer more writing assignments (Shermis & Burstein, 2003). Several systems now offer scoring, feedback, and class management tools, such as Criterion (Attali & Burstein, 2006), WriteToLearn (Landauer, Lochbaum, & Dooley, 2009), and MyAccess (Grimes & Warschauer, 2010). AWE tools gather large amounts of data about texts, such as structure, word use, syntax, cohesion, and semantic similarity (e.g., Landauer, McNamara, Dennis, & Kintsch, 2007; McNamara, Graesser, McCarthy, & Cai, in press; Shermis & Burstein, 2003). Importantly, students’ ability Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. to produce text that is technically correct (e.g., spelling), structured (e.g., paragraphs), and lexically proficient (e.g., use of rare words) is related to writing quality (Deane, 2013). Thus, scoring tools can and do leverage the links between linguistic features and overall writing quality to generate scores that are reliably similar to human ratings. Despite such computational power, AWE necessarily excludes aspects of writing that are difficult to detect automatically, which tend to correspond to higher-level issues (e.g., comprehension). As a result, the proliferation of AWE has met with justifiable criticism, with perhaps the strongest objections pertaining to construct validity (Anson et al., 2013; Condon, 2013; Deane, 2013). A common concern is that AWE tools are not able to assess the most meaningful aspects of good and poor writing, such as writers’ demonstrated knowledge and understanding of the essay topic, the persuasiveness of arguments, or an engaging style. In short, many argue that automated scoring fails “to measure meaningfulness of content, argumentation quality, or rhetorical effectiveness” because AWE systems “do not measure the full writing construct, but rather, a restricted construct” (Deane, 2013, p. 16). These deficits may be further exhibited in the feedback that AWE systems can give to developing writers. If a system cannot detect higher-level features of writing, then that system also cannot provide intelligent, personalized assistance on those same features to struggling students. To begin exploring the issue of construct validity in automated scoring, this paper uses available data to examine how and whether human and automated scores for prompt-based essays are correlated to students’ knowledge and related literacy skills of reading comprehension and vocabulary. Although this is not an exhaustive list of the qualities of good writers, knowledge and literacy skills offer a meaningful point of departure. Proficient writers not only adhere to rules of grammar and spelling, they also display skillful use of their knowledge and understanding of the topic (Chesky & Hiebert, 1987; McCutchen, 2000). Similarly, students’ ability to write comprehensibly is related to their skills in comprehending text and vocabulary Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference
منابع مشابه
Toward Evaluation of Writing Style: Finding Overly Repetitive Word Use in Student Essays
Automated essay scoring is now an established capability used from elementary school through graduate school for purposes of instruction and assessment. Newer applications provide automated diagnostic feedback about student writing. Feedback includes errors in grammar, usage, and mechanics, comments about writing style, and evaluation of discourse structure. This paper reports on a system that ...
متن کاملInvestigating neural architectures for short answer scoring
Neural approaches to automated essay scoring have recently shown state-of-theart performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embe...
متن کاملThe application of network automated essay scoring system in college English writing course
With the innovation of computer technology and the application of the new network technology, many auxiliary network teaching and learning software and programs related to English teaching has emerged. Automated essay scoring system is a revolutionary innovation network system, in recent years, it makes the reform of English essay teaching mode more feasible. Basing on the network automated ess...
متن کاملIncorporating learning characteristics into automatic essay scoring models: What individual differences and linguistic features tell us about writing quality
This study investigates a novel approach to automatically assessing essay quality that combines natural language processing approaches that assess text features with approaches that assess individual differences in writers such as demographic information, standardized test scores, and survey results. The results demonstrate that combining text features and individual differences increases the a...
متن کاملEvidence for the Interpretation and Use of Scores from an Automated Essay Scorer
This paper examined validity evidence for the scores based on the Intelligent Essay Assessor (IEA), an automated essay-scoring engine developed by Pearson Knowledge Technologies. A study was carried out using the validity framework described by Yang, et al. (2002). This framework delineates three approaches to validation studies: examine the relationship among scores given to the same essays by...
متن کامل